# Multi-Model Violence Labeling Guide

Process your CSV files with multiple LLM models and compare their results.

## Quick Start

### Process a single file with all models:

```bash
python multi_model_labeling.py comments_batch_01.csv
```

This will:
1. Process the file with all 4 models:
   - ibm/granite-3.2-8b
   - google/gemma-3n-e4b
   - qwen3-vl-8b-instruct-mlx
   - qwen/qwen3-vl-4b

2. Create separate output files for each model:
   - `comments_batch_01_ibm_granite_3_2_8b.csv`
   - `comments_batch_01_google_gemma_3n_e4b.csv`
   - `comments_batch_01_qwen3_vl_8b_instruct_mlx.csv`
   - `comments_batch_01_qwen_qwen3_vl_4b.csv`

3. Create a comparison file with all models:
   - `comments_batch_01_ALL_MODELS_COMPARISON.csv`

### Specify custom output directory:

```bash
python multi_model_labeling.py comments_batch_01.csv my_results
```

Results will be saved to `my_results/` instead of default `multi_model_results/`.

## Output Format

### Individual Model Files
Each model gets its own file with columns:
- `comment_id` - Original ID
- `text_original` - Original text
- `discusses_violence_[model_name]` - True/False
- `violence_score_[model_name]` - 0-10 score
- `explanation_[model_name]` - Brief explanation

### Comparison File
Contains ALL columns from all models in one file, so you can easily compare:

```csv
comment_id,text_original,discusses_violence_ibm_granite,violence_score_ibm_granite,discusses_violence_google_gemma,...
"ABC123","Text here...",True,5,"explanation",False,2,"different explanation",...
```

## Summary Statistics

After processing, you'll see a comparison table like:

```
MODEL COMPARISON SUMMARY:
Model                               Violence %     Avg Score
------------------------------------------------------------
ibm/granite-3.2-8b                     23.4%          2.15
google/gemma-3n-e4b                    18.7%          1.82
qwen3-vl-8b-instruct-mlx              31.2%          2.98
qwen/qwen3-vl-4b                       27.5%          2.43
```

This helps you see which model is more/less sensitive to violence detection.

## Processing Multiple Files

To process all your batch files with all models:

```bash
for file in comments_batch_*.csv; do
    python multi_model_labeling.py "$file"
done
```

Or create a simple bash script:

```bash
#!/bin/bash
for i in {01..10}; do
    echo "Processing batch $i..."
    python multi_model_labeling.py "comments_batch_${i}.csv"
done
```

Save as `process_all.sh`, make executable with `chmod +x process_all.sh`, then run `./process_all.sh`.

## Customizing Models

To use different models, edit `multi_model_labeling.py` line 15:

```python
MODELS = [
    "ibm/granite-3.2-8b",
    "google/gemma-3n-e4b",
    "qwen3-vl-8b-instruct-mlx",
    "qwen/qwen3-vl-4b"
]
```

Remove models you don't want or add others you've loaded in LM Studio.

## Time Estimates

- **Per comment per model**: ~1-2 seconds
- **100 comments with 4 models**: ~10-15 minutes
- **1000 comments with 4 models**: ~2-3 hours

The script processes one model at a time, so total time = (comments × models × ~1.5 seconds).

## Analyzing Results

### Load comparison file in Python:

```python
import pandas as pd

df = pd.read_csv('multi_model_results/comments_batch_01_ALL_MODELS_COMPARISON.csv')

# See agreement between models
model_cols = [col for col in df.columns if 'discusses_violence' in col]
df['agreement_count'] = df[model_cols].sum(axis=1)
df['all_agree_violence'] = df['agreement_count'] == len(model_cols)
df['all_agree_no_violence'] = df['agreement_count'] == 0

print(f"All models agree violence: {df['all_agree_violence'].sum()}")
print(f"All models agree no violence: {df['all_agree_no_violence'].sum()}")
print(f"Models disagree: {(~df['all_agree_violence'] & ~df['all_agree_no_violence']).sum()}")
```

### Calculate inter-model agreement:

```python
from sklearn.metrics import cohen_kappa_score

# Compare two models
model1_col = 'discusses_violence_ibm_granite_3_2_8b'
model2_col = 'discusses_violence_google_gemma_3n_e4b'

kappa = cohen_kappa_score(df[model1_col].dropna(), df[model2_col].dropna())
print(f"Cohen's Kappa: {kappa:.3f}")
```

## Tips

1. **Start small**: Test with one small file first
2. **Check agreement**: High disagreement between models may indicate ambiguous cases
3. **Average scores**: Consider averaging the violence scores across models
4. **Qualitative check**: Review cases where models disagree most
5. **Model selection**: If one model consistently performs better, use only that one

## Troubleshooting

**Error: "Model not found"**
- Make sure all models are loaded in LM Studio
- Only one model needs to be active at a time
- LM Studio will load models automatically when called

**Different models give very different results**
- This is normal! Different models have different training
- Use this as part of your analysis
- Consider reporting results from multiple models

**Process interrupted**
- Individual model files are saved as they complete
- You can resume by removing completed models from the MODELS list
- Or just re-run; it will overwrite
